feat: data overlay files — model, feature flag, and write/commit path#7407
feat: data overlay files — model, feature flag, and write/commit path#7407wjones127 wants to merge 9 commits into
Conversation
Add a specification for data overlay files: small files attached to a fragment that supply new values for a subset of (row offset, field) cells without rewriting the base data files, for cheap cell-level updates. - protos/table.proto: rework DataOverlayFile with a dense/sparse coverage oneof (shared_offset_bitmap vs new FieldCoverage), rename read_version to committed_version (effective, commit-stamped), and document rank-based addressing with no offset column. Document reader feature flag 64. - docs: add data_overlay_file.md (full spec, worked example, guidance stub) and link it from the table format overview. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the `DataOverlay` operation (and `DataOverlayGroup`) to attach overlay files to fragments without rewriting their base data. Mirrors the `DataReplacement` batch shape, appends to each fragment's `overlays` list, and documents permissive conflict semantics: concurrent overlays, appends, deletes, and column rewrites are compatible; row-rewrites, compaction, and overlay->base folds conflict. committed_version is left 0 by the writer and stamped at commit time. Proto only — Rust/Python bindings deferred. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The table/transaction proto changes generate new fields and an Operation variant. This wires the minimum needed to compile without implementing overlay support: - Emit empty `overlays` when converting fragments to proto. - Reject the `DataOverlay` transaction operation with NotSupported on read. Datasets that use overlays set reader feature flag 64, which already falls in the unknown-flag range rejected by `can_read_dataset`, so the library refuses them at the feature-flag layer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Important This PR touches the Lance format specification. Substantive changes to the format specification — the If this is a meaningful format change:
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
| /// | ||
| /// For a dense overlay the same shared bitmap is returned for every field; | ||
| /// for a sparse overlay the per-field bitmap at `field_pos` is returned. | ||
| pub fn coverage_for_field(&self, field_pos: usize) -> Result<RoaringBitmap> { |
There was a problem hiding this comment.
issue(blocking): this deserialized the bitmap on each invocation. Can we deserialize the bitmaps on load instead?
| return Err(Error::invalid_input( | ||
| "DataOverlayFile is missing its coverage", | ||
| )); |
There was a problem hiding this comment.
suggestion: add a test for this branch.
| pub const FLAG_UNKNOWN: u64 = 64; | ||
| pub const FLAG_UNKNOWN: u64 = 128; |
There was a problem hiding this comment.
suggestion: since we don't support all operations yet, I'm wondering if what we should do is make this default to 128 on debug builds, 64 on release builds, and maybe have an environment variable that let's you override in release builds for benchmarking. That way, we aren't releasing this into the wild just yet, but can test it.
| #[derive(Debug, Clone)] | ||
| pub struct ResolvedFieldOverlay { | ||
| /// Physical offsets this overlay covers for the field. | ||
| pub coverage: RoaringBitmap, |
There was a problem hiding this comment.
suggestion: this is immutable, right? Should we put this behind an Arc for sharing?
| /// Precedence is by `committed_version` (higher is newer); ties are broken by | ||
| /// position in the fragment's `overlays` list, where a later entry is newer. | ||
| /// Returns indices into `overlays`. |
There was a problem hiding this comment.
suggestion: if position in the list is already load bearing, why don't we just assume it's already in version order. On load, we can validate this, and do a one-time sort if needed. Then we can assume it's sorted in all further operations.
The v2 file writer advanced every column from a single global row counter, so a single file could only hold columns of equal length. Sparse data overlay files need columns whose item counts differ within one file (each field covers a different set of rows). Add `FileWriter::write_columns`, which writes a set of `(field, array)` pairs and advances each field's row counter independently, leaving other fields untouched. A field never written ends up as a zero-length column. `write_batch` is unchanged: it still advances all fields together, so ordinary rectangular files round-trip exactly as before. Per-column lengths were already derivable from page metadata; expose them via `FileReader::column_num_rows`. The reader already schedules each column from its own pages, so reading a column at its own length and random access within it work without further changes. Part of the Data Overlay Files feature (OSS-1323). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review on the unequal-length-columns work: - Rename `FileWriter::write_columns(Vec<(usize, ArrayRef)>)` to the simpler `write_column(column_index, array)`; callers loop instead of batching. - Drop doc comments from internal-only helpers. - Reader: validate that a projection's top-level columns share a length and reject mismatches up front with a descriptive error, instead of panicking in the decoder. `RangeFull`/`RangeFrom` now resolve to the projected common length rather than the file's longest column, so a single short column reads back at its own length. Rectangular files are unchanged. Adds `test_read_unequal_length_projection` (V2_0 + V2_1).
Adds the in-memory + commit machinery for data overlay files (per the spec in lance-format#7381), the foundation the scanner/take/index/compaction work builds on. - `DataOverlayFile` / `OverlayCoverage` (dense `shared_offset_bitmap` and sparse per-field) with protobuf round-trip, attached to `Fragment.overlays`. - Reader feature flag 64 (`FLAG_DATA_OVERLAY_FILES`): set whenever any fragment carries overlays, so a reader that does not understand them refuses the dataset instead of returning stale base values. - `Operation::DataOverlay` transaction op: appends overlays to a fragment's list (preserving concurrently-written overlays) and stamps each overlay's `committed_version` to the new dataset version at commit time (re-stamped on retry). Conflict rules mirror DataReplacement — permissive against appends, deletes, column rewrites, index builds, and other overlays; conflicts only with row-rewriting compaction of the same fragment. Scan-side merge, take, and end-to-end write+read tests follow in the same PR branch. Part of the Data Overlay Files feature (OSS-1322). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Now that overlays can be committed, a scan or take over a fragment that has overlays would silently return stale base values, since the read-path merge is not implemented yet. Refuse such reads at `FileFragment::open` with a clear error instead of serving incorrect data. Lifted once the scan/take merge lands (rest of OSS-1322 / OSS-1324). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…flag OverlayCoverage now holds parsed Arc<RoaringBitmap>s, deserialized once when a fragment is loaded and shared cheaply on clone, instead of re-deserializing the serialized bytes on every coverage_for_field call. Treat the data overlay feature flag (64) as unknown in release builds unless LANCE_ENABLE_DATA_OVERLAY_FILES is set, so the unreleased feature is neither emitted by release writers nor accepted by release readers; debug builds understand it so tests exercise the path. Stable-sort a fragment's overlays by committed_version (newest last) on load so resolution can assume the ordering. Adds tests for the missing-coverage/data_file errors, the release gating, and the load-time sort. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cdcd8e3 to
a1157a9
Compare
Three correctness fixes in the data-overlay write/commit path, each with tests (this path had ~0% coverage): - build_manifest dropped overlays when two DataOverlayGroups targeted the same fragment (HashMap collect kept only the last). Merge groups by fragment_id so all overlays are appended in order. - check_data_overlay_txn did not conflict when a concurrent Update/Delete removed an overlaid fragment, leaving the overlay orphaned and hard-erroring on retry. Mirror check_data_replacement_txn: retryable conflict when removed/deleted_fragment_ids intersect our fragments; fragments merely updated in place stay compatible. - impl PartialEq for Operation had the DataOverlay arm backwards: it reported DataOverlay == Rewrite as true and DataOverlay == DataOverlay as false (conflict semantics copied into the equality impl). Compare groups for the same variant; not equal to other variants. Tests: build_manifest append/stamp, duplicate-group merge, unknown-fragment error; the conflict matrix incl. remove-fragment conflict; Operation equality; OverlayCoverage serde JSON round-trip; coverage_for_field out-of-bounds; apply_feature_flags overlay arm. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Data model, feature flag, and write/commit foundation for Data Overlay Files (OSS-1322).
Stacked on #7381 (spec/proto) and #7406 (OSS-1323 — its
FileWriter::write_columnsis needed for the sparse overlay sink). Both stacks' commits show in this diff until they merge.What's here
DataOverlayFile+OverlayCoverageonFragment.overlays, dense (shared_offset_bitmap) and sparse (per-field). Coverage bitmaps are parsed once on load intoArc<RoaringBitmap>(not re-deserialized per access), and a fragment's overlays are stable-sorted bycommitted_version(newest last) on load so resolution can rely on the ordering. Protobuf round-trip +coverage_for_fieldtested.FLAG_DATA_OVERLAY_FILES) — set when any fragment has overlays. Release-gated: treated as an unknown flag in release builds (so release readers/writers refuse overlay datasets) unlessLANCE_ENABLE_DATA_OVERLAY_FILESis set; debug builds understand it. Tested.Operation::DataOverlaytransaction — appends overlays to a fragment (preserving concurrently-written ones) and stampscommitted_versionat commit, re-stamped on retry. Protobuf round-trip tested.CommitConflictResolver) — permissive likeDataReplacement: compatible with append / column-rewrite / index-build / other overlays, and with deletes/updates that leave the overlaid fragment in place; conflicts when a concurrent op removes or row-rewrites an overlaid fragment (Update/Deleteremoval,Rewrite/Merge) and with whole-datasetOverwrite/Restore. Tested (matrix + remove-fragment conflict).The fragment read path refuses overlays until the scan merge lands.
Follow-ups on this stack
DataOverlayFileWriter— streaming dense+sparse sink, used internally behindupdate/merge_insert.validate()overlay checks — value-column length vs. coverage cardinality, dtype vs. schema (I/O-bound; cheap structural invariants are enforced at parse/commit time).Blocks OSS-1324 (take/scan), OSS-1325 (index masking), OSS-1326 (compaction).
🤖 Generated with Claude Code